Encoding

  • Maps characters (text) to byte sequences and vice versa. Used for text fields in serialization.

Naming Confusion
  • When someone says "custom package encoding" , they usually mean:

    • A framing protocol  (how message start/end is delimited).

    • A custom serialization/deserialization  strategy.

    • A binary or textual format for transmitting structures over the network.

  • Using "encoding" for package framing strategies is technically valid but potentially ambiguous.

  • In networking, it’s better to use more specific terms.

  • The word "encoding" itself isn’t wrong but should be interpreted in the technical context.

  • In Odin, JSON and CBOR are considered "encoding" .

Text

UTF-8
  • Unicode Transformation Format – 8-bit

  • Size :

    • ASCII characters (0–127) use 1 byte

    • Non-ASCII characters use up to 4 bytes

    • For languages with many non-ASCII characters (e.g., Chinese, Japanese), it can take more space than UTF-16

  • Web standard (used by HTML, JSON, XML, etc.)

  • Backward compatible with ASCII; valid ASCII text is valid UTF-8

  • Serialization:

    • UTF-8 can be considered a form of serialization, specifically for binary text serialization

UTF-16
  • Size :

    • BMP characters (Basic Multilingual Plane, U+0000 to U+FFFF) use 2 bytes

    • Characters outside BMP (e.g., emojis, historical scripts) use 4 bytes (surrogate pairs)

    • More efficient for languages with many BMP characters (e.g., many Asian languages)

  • Widely used in some APIs and programming languages (e.g., Java, Windows, .NET)

UTF-32
  • Size : All characters are 4 bytes, making manipulation and indexing easier

ASCII
  • American Standard Code for Information Interchange

  • Legacy system compatibility : For old systems or devices that only support ASCII

  • Simple English text : When text contains only basic characters (A–Z letters, 0–9 digits, basic punctuation)

  • Simplicity : ASCII uses exactly 1 byte (8 bits) per character, simplifying processing in very basic systems

Base64

  • Is a way to represent arbitrary binary data using only printable ASCII characters.

  • It is not encryption or compressionβ€”just an encoding so binary data can be stored safely in text formats.

  • It is called Base64 because the encoding uses a numeral system with 64 distinct symbols to represent data.

    • Each Base64 character encodes 6 bits.

    • So you need exactly 64 symbols ( 2^6 = 64 ) to represent every possible 6-bit value.

  • Base64 exists as many systems historically handled text only.

  • Raw binary can contain:

    • null bytes (0x00)

    • control characters

    • non-printable bytes

  • Converts binary β†’ safe text using only:

    A–Z a–z 0–9 + /
    
  • Padding uses =  but it is not part of the base.

Size
  • Base64 increases size by about 33%.

  • 4 output bytes per 3 input bytes.

encoded_size β‰ˆ ceil(input_size / 3) * 4
  • "Why on earth would you use an encoding that increases the size of the thing?"

    • Because Base64 solves transport and compatibility problems, not size efficiency. It is used when binary must safely travel through systems that are text-only or text-fragile.

    • Raw binary can break many pipelines due to:

      • null bytes (0x00)

      • control characters

      • encoding assumptions (UTF-8/UTF-16)

      • line-ending conversions

      • legacy text parsers

    • Historically (and still today), many formats and tools expect text, not arbitrary bytes.

    • Base64 guarantees the data contains only safe printable ASCII.

Core idea
  • Base64 works in 6-bit chunks.

  • Binary bytes are 8 bits each

  • Base64 symbols encode 6 bits each

  • So it repacks data

3 bytes (24 bits) β†’ 4 Base64 characters
  • as

3 Γ— 8 = 24 bits
4 Γ— 6 = 24 bits
  • Example :

    • Input:

      "Man"
      
    • Write bytes:

      M = 01001101
      a = 01100001
      n = 01101110
      
      010011010110000101101110
      
    • Split into 6-bit groups:

      010011 010110 000101 101110
      
    • In decimals:

      19 22 5 46
      
    • Map to Base64 alphabet:

      0–25  β†’ A–Z
      26–51 β†’ a–z
      52–61 β†’ 0–9
      62    β†’ +
      63    β†’ /
      
    • Output:

      19 β†’ T
      22 β†’ W
      5  β†’ F
      46 β†’ u
      
      • Concatenated:

      TWFu
      
Padding rules
  • If input length is not divisible by 3, Base64 pads with = .

  • Example :

    • Input:

      "Ma"
      
    • Write binary:

      01001101 01100001
      
    • Split into 6-bit groups:

      010011 010110 000100 000000
      
    • In decimals -> Map to Base64 alphabet.

    • Output:

      TWE=